Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching From File #1914

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

SelfOnTheShelf
Copy link

@SelfOnTheShelf SelfOnTheShelf commented Aug 18, 2021

Category

This change is exactly one of the following (please change [ ] to [x]) to indicate which:

  • a bug fix (Fix #...)
  • a new Ripper
  • a refactoring
  • a style change/fix
  • a new feature

Description

This feature adds support to use Redis as the mechanism for skipping already downloaded URLs. If you use RipMe over a longer period of time, to download many, many galleries and albums, the url_history.txt file gets quite large. Doing an O(n) scan through the entire list for every URL in a job becomes VERY expensive. My own url_history.txt file is approaching 3 million lines and 130 MB. Using Redis speeds up the ripping process considerably AND allows power users the ability to coordinate jobs running across multiple machines on a network.

Users can optionally add the following lines to the rip.properties file:

url_history.redis_cache.host = 192.168.0.123  #IP address or domain name for the redis host 
url_history.redis_cache.port = 6379 #redis port (optional, rip me defaults to 6379)
url_history.redis_cache.key_prefix = RipMeURL:  #a prefix to give the keys added to redis (optional, will default to an empty string)

If users do not add this configuration, the URL matching algorithm now uses a HashSet. This is memory intensive, but performs faster than the sequential scan.

Note: RipMe will continue to append new lines to the url_history.txt file since this operation does not seem to slow down the job (...at least at the scales that I have encountered)

Note 2: The easiest way to run redis locally is to use docker (Something like docker run --name my-redis -d -p 6379:6379 redis). Alternatively you could download and install redis for your OS.

Testing

Required verification:

  • I've verified that there are no regressions in mvn test (there are no new failures or errors).
  • I've verified that this change works as intended.
    • Downloads all relevant content.
    • Downloads content from multiple pages (as necessary or appropriate).
    • Saves content at reasonable file names (e.g. page titles or content IDs) to help easily browse downloaded content.
  • I've verified that this change did not break existing functionality (especially in the Ripper I modified).

Optional but recommended:

  • I've added a unit test to cover my change.

@SelfOnTheShelf SelfOnTheShelf changed the title Power Users Can Use Redis to Filter Already Downloaded URLs Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching Aug 19, 2021
@SelfOnTheShelf SelfOnTheShelf changed the title Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching Power Users Can Use Redis to Filter Already Downloaded URLs, Use HashSet for URL Matching From File Aug 20, 2021
@soloturn
Copy link
Contributor

what a cool pull request. not that i'd ever need it - but the principle is a great show case :) tried to merge here: https://github.com/ripmeapp2/ripme , but i then wondered how to see within a couple of seconds now and in future if it works. you mind doing a tiny unit test just, maybe in the lines of:
https://www.baeldung.com/spring-embedded-redis

@SelfOnTheShelf
Copy link
Author

@soloturn I've added some tests for this. Please let me know if you want me to change anything, especially regarding style. I'm both new to this codebase and the Java world in general!

@MarcoBorrini99
Copy link

@soloturn I've added some tests for this. Please let me know if you want me to change anything, especially regarding style. I'm both new to this codebase and the Java world in general!

It seems to works, the only "downside" is that project seems to be abandoned

@soloturn
Copy link
Contributor

soloturn commented Oct 17, 2021

thank you @SelfOnTheShelf ! 3 tiny things if you could adjust please:

  • use latest versions for your dependencies
  • if you could add the dependencies to the build.gradle.kts file as well, ripme2 has no maven build any more but gradle
  • reorder the Hashset import alphabetic so it would make it merge without conflict into ripme2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants